Théo Druilhe, Pierre Larose, Nathan Pizzetta, Sigurd Saue
Context
In this project, the objective is to develop a predictive model for corporate bankruptcy, utilizing historical financial and default data. Companies may encounter various forms of financial distress, such as missed payments, distressed exchanges, or formal bankruptcy proceedings like Chapter 7 and Chapter 11 filings. By analyzing these events and their associated financial indicators, this project aims to build a dataset that captures each company’s fiscal health over time and leverages machine learning to forecast the likelihood of default.
The work involves combining multiple datasets—Compustat, LoPucki, and Moody’s—which each track different aspects of corporate financial distress. After careful preprocessing to align and clean the data, the most recent financial and default indicators for each company are extracted. These serve as the target variable \(Y\) for bankruptcy risk, which will later be matched with a comprehensive feature set \(X\) to train the predictive model. This analysis aims to provide insights into the patterns that precede default and to contribute a robust tool for assessing bankruptcy risk in real-world scenarios.
Building our dataset
In this section we will focus on building our target variable \(Y\) for bankruptcy prediction and our feature set \(X\) for training the predictive model.
Building Y - Bankruptcy Data
To build our target variable we will go through the following steps:
Extracting the relevant data from the Compustat, LoPucki, and Moody’s datasets.
Merging these datasets to create a comprehensive dataset that captures the financial health and default history of each company.
Creating the target variable \(Y\) based on bankruptcy events within 1 year of each fiscal year.
Selecting the most recent fiscal year for each company to use as the target variable dataset.
The comp.company database contains some information for a wide range of companies: address, phone number, website url, … For our analysis we are interested in the deletion date and the deletion reason which gives us potential information about companies that could have been bankrupted. To do so, we will extract the company name, the key, the reason for deletion, and the deletion date.
WRDS Connection
Example of code to connect to WRDS database and extract data from Compustat database
Code
library(tidyverse)library(dbplyr)library(RPostgres)wrds <-dbConnect(Postgres(),host ="wrds-pgdata.wharton.upenn.edu",dbname ="wrds",port =9737,sslmode ="require",user =Sys.getenv("wrds_user"),password =Sys.getenv("wrds_password"))compustat_list <-dbListObjects(wrds, Id(schema ="comp"))# Use dplyr verbs with a remote database table# https://dbplyr.tidyverse.org/reference/tbl.src_dbi.htmlfunda_db <-tbl(wrds, in_schema("comp", "funda"))funda_db %>%filter(grepl('APPLE INC', conm)) %>%select(gvkey, fyear, conm, at, wcap, re, ebit, lt, sale) %>%mutate(WCTA = wcap / at,RETA = re / at,EBTA = ebit / at,TLTA = lt / at, # as a proxy for ME/TLSLTA = sale / at)
Code
library(dplyr)library(lubridate)source("../scripts/helper_functions.R")# Construct the file path in a way that works on both macOS and Windowsfile_path <-file.path("..", "data", "wrds_data", "company_all.rds")# Check if the file exists and read the Excel fileif (file.exists(file_path)) { company_db <-readRDS(file_path)} else {print("File not found. Please check the file path.")}# Extract relevant company informationcompustat_company_data <- company_db %>%filter(is.na(dldte) | (year(dldte) >="2011")) %>%select(gvkey, conm, dlrsn, dldte) %>%collect()styled_dt(compustat_company_data)
This dataset is key because it links the fiscal year with the financial data. To use it porperly we will try to keep all the companies where we can be confident if they were bankrupt or not. The objective of predicting a company bankruptcy on the next year relies on the ongoing fiscal year, it is therefore necessary to have it for all our companies in our dataset.
Code
# Construct the file path in a way that works on both macOS and Windowsfile_path <-file.path("..", "data", "wrds_data", "compustat_all.rds")# Check if the file exists and read the Excel fileif (file.exists(file_path)) { compustat_all <-readRDS(file_path)} else {print("File not found. Please check the file path.")}# Select relevant financial metrics and compute financial ratiosfiscal_year <- compustat_all %>%filter(fyear >="2010") %>%select(gvkey, fyear)fy_size <- fiscal_year %>%group_by(gvkey) %>%filter(fyear==max(fyear)) %>% dimcat("Number of companies within this period of time :", fy_size[1])
Number of companies within this period of time : 21758
Lopucki database includes approximately one thousand large public company bankruptcies filed in the United States Bankruptcy Courts under Chapter 7 or Chapter 11 since October 1, 1979 to December 11, 2022.
Chapter 7 : liquidation
Chapter 11 : reorganization
We will extract the company name, the key, the filling date, and the chapter.
Code
# Construct the file path in a way that works on both macOS and Windowsfile_path <-file.path("..","data", "default_data", "Bankruptcy - LoPucki", "Florida-UCLA-LoPucki Bankruptcy Research Database 1-12-2023.xlsx")# Check if the file exists and read the Excel fileif (file.exists(file_path)) { lopucki_db <- readxl::read_xlsx(file_path)} else {print("File not found. Please check the file path.")}# Filter for Chapter 7 (liquidation) and Chapter 11 (reorganization) bankruptcieslopucki_clean <- lopucki_db %>%filter((Chapter %in%c("7", "11")) & (Disposition !="Chapter 7 at filing") & (year(DateFiled) >="2011")) %>%# Exclude Chapter 7 at filing as advised in the do file of the datasetselect(GvkeyBefore, NameCorp, DateFiled, Chapter) %>%group_by(GvkeyBefore) %>%summarize(DateFiled =min(DateFiled),NameCorp = NameCorp[which.min(DateFiled)],Chapter = Chapter[which.min(DateFiled)]) %>%mutate(DateFiled = lubridate::as_date(DateFiled)) %>%ungroup()styled_dt(lopucki_clean)
Moody’s Annual Default Reports for Corporates provide detailed analysis of corporate bond defaults and credit rating changes. These reports track default rates, recovery rates, and rating transitions for companies around the world.
Data was extracted from Moody’s Annual reports covering Corporate Default starting in 2002, it is more heterogeneous than LoPucki and not directly linkable to Compustat since only the name of the company is reported, also depending on the year, either full date, the month or nothing is reported regarding the default/bankruptcy event.
What is interesting for us with this dataset, is that a lot of companies are deleted from comp.company for a different reason than credit (Acquisitions, Other) but are in fact defaulting companies in Moody’s dataset.
Code
# Construct the file path in a way that works on both macOS and Windowsfile_path <-file.path("..","data", "default_data", "dat_default_moodys_annual.rds")# Check if the file exists and read the Excel fileif (file.exists(file_path)) { moody_db <-readRDS(file_path)} else {print("File not found. Please check the file path.")}moody_clean <- moody_db %>%filter(year(default_date) >="2011") %>%select(company_name, default_type, default_date)# Inspect the best matchesstyled_dt(moody_clean)
Merging Datasets
Compustat and LoPucki
To merge the Compustat and LoPucki datasets, we will use the gvkey (Compustat) and GvkeyBefore (LoPucki) columns as the key for merging.
Code
# Merge Compustat and LoPucki data based on gvkey (Compustat) and GvkeyBefore (LoPucki)comp_lopucki <- compustat_company_data %>%left_join(lopucki_clean, by =c("gvkey"="GvkeyBefore"))
Comp_lopucki with Moody
To manage to merge the Moody’s dataset with the Compustat dataset, we need to standardize the company names in both datasets. We will remove punctuation, whitespace, and common company suffixes to create a common key for merging.
Code
# Standardize company names in both datasetstrim_pattern <-",|\\.|INC|LLC|CORPORATION|CORP|COMPANY|\\*|\\s+"# Clean Moody's datamoody_clean <- moody_clean %>%mutate(compact_name = stringr::str_to_upper(company_name),compact_name = stringr::str_trim(gsub(trim_pattern, "", compact_name)) )# Clean Compustat company datacomp_lopucki <- comp_lopucki %>%mutate(compact_name = stringr::str_trim(gsub(trim_pattern, "", conm)) )comp_lopucki_moody <- comp_lopucki %>%left_join(moody_clean, by =c("compact_name")) %>%filter(!is.na(gvkey)) %>%select(gvkey, conm, dlrsn, dldte, DateFiled, Chapter, default_date, default_type)
Comp_lopucki_moody with Fiscal Year
Here, we will merge the combined dataset with the fiscal year data to create a comprehensive dataset that captures the financial health and default history of each company.
The Y variable will be created based on bankruptcy events within 1 year of each fiscal year, information like Chapter 7 and Chapter 11 bankruptcies from LoPucki, Moody’s default and Computstat deletion date and reason.
Code
# Merge financial data with the complete bankruptcy databankruptcy_data <-fiscal_year %>%left_join(comp_lopucki_moody, by ="gvkey") %>%mutate(# Define target variable Y based on bankruptcy events within 1 year of each fiscal yearY =ifelse(# Compustat bankruptcy within 1 year after fiscal year (!is.na(dldte) & dlrsn %in%c("02", "03") & (fyear +1==year(dldte))) |# LoPucki bankruptcy within 1 year after fiscal year (!is.na(DateFiled) & (fyear +1==year(DateFiled))) |# Moody's default within 1 year after fiscal year (!is.na(default_date) & (grepl("Bankruptcy", default_type, ignore.case =TRUE) |grepl("Chapter 11", default_type, ignore.case =TRUE)) & (fyear +1==year(default_date))), 1, 0 ) )
Cleaning Bankruptcy Dataset
In our dataset, some companies went bankrupt at some year. But then, they are still in the dataset for the following years. For these companies, we need to delete the years following the bankruptcy year.
Code
# Identify bankruptcy years for each companybankrupt_years <- bankruptcy_data %>%filter(Y ==1) %>%select(gvkey, bankruptcy_fyear = fyear)# Join bankruptcy years back to the original databankruptcy_data_cleaned <- bankruptcy_data %>%left_join(bankrupt_years, by ="gvkey") %>%# Keep rows where either there is no bankruptcy year or fyear is less than or equal to bankruptcy yearfilter(is.na(bankruptcy_fyear) | fyear <= bankruptcy_fyear) %>%select(-bankruptcy_fyear)
We already have well specified the companies that went bankrupt, but we need to clean the dataset to keep only the most relevant “non bankrupt” companies.
To do so, we will look at the reason of deletion for the companies that are not bankrupt and delete the companies with a deletion reason that cannot tells us if they went bankrupt.
For example, a company deleted for the reason “Acquisition or merger” should be keep in the dataset since it is not a bankruptcy reason. But, a company
Code
# Remove rows for companies with no bankruptcies and specific dlrsn valuesbankruptcy_data_cleaned <- bankruptcy_data_cleaned %>%group_by(gvkey) %>%filter(!(sum(Y) ==0& dlrsn %in%c(05, 10, 07))) %>%ungroup()
Extracting target Y from Bankruptcy Data
Finally, we will extract the most recent fiscal year for each company to use as the target variable dataset. The objective is to keep only the most recent information for each company to predict bankruptcy risk.